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Using Unicode with MIME 


Status of this Memo 


This memo defines an Experimental Protocol for the Internet community. This memo does 
not specify an Internet standard of any kind. Distribution of this memo is unlimited. 


Abstract 


The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993(E) jointly define a 16 bit 
character set (hereafter referred to as Unicode) which encompasses most of the world’s 
writing systems. However, Internet mail (STD 11, RFC 822) currently supports only 7-bit 
US ASCII as a character set. MIME (RFC 1521 and RFC 1522) extends Internet mail to 
support different media types and character sets, and thus could support Unicode in mail 
messages. MIME neither defines Unicode as a permitted character set nor specifies how it 
would be encoded, although it does provide for the registration of additional character sets 
over time. 


This document specifies the usage of Unicode within MIME. 


Motivation 


Since Unicode is starting to see widespread commercial adoption, users will want a way to 
transmit information in this character set in mail messages and other Internet media. Since 
MIME was expressly designed to allow such extensions and is on the standards track for the 
Internet, it is the most appropriate means for encoding Unicode. RFC 1521 and RFC 1522 do 
not define Unicode as an allowed character set, but allow registration of additional character 
sets. 


In addition to allowing use of Unicode within MIME bodies, another goal is to specify a way 
of using Unicode that allows text which consists largely, but not entirely, of US-ASCII 
characters to be represented in a way that can be read by mail clients who do not understand 
Unicode. This is in keeping with the philosophy of MIME. Such an encoding is described in 
another document, “UTF-7: A Mail Safe Transformation Format of Unicode” [UTF-7]. 


Overview 


Several ways of using Unicode are possible. This document specifies both guidelines for use 
of Unicode within MIME, and a specific usage. The usage specified in this document is a 
straightforward use of Unicode as specified in “The Unicode Standard, Version 1.1”. 
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This encoding is intended for situations where sender and recipient do not want to do a lot of 
processing, when the text does not consist primarily of characters from the US-ASCII 
character set, or when sender and receiver are known in advance to support Unicode. 


Another encoding is intended for situations where the text consists primarily of US-ASCII, 
with occasional characters from other parts of Unicode. This encoding allows the US-ASCII 
portion to be read by all recipients without having to support Unicode. This encoding is 
specified in another document, “UTF-7: A Mail Safe Transformation Format of Unicode” 
[UTF-7]. 


Finally, in keeping with the principles set forth in RFC 1521, text which can be represented 
using the US-ASCII or ISO-8859-x character sets should be so represented where possible, 
for maximum interoperability. 


Definitions 
The definition of character set Unicode: 


The 16 bit character set Unicode is defined by “The Unicode Standard, Version 1.1”. 
This character set is identical with the character repertoire and coding of the international 
standard ISO/IEC 10646-1:1993(E); Coded Representation Form=UCS-2; Subset=300; 
Implementation Level=3. 


Note. Unicode 1.1 further specifies the use and interaction of these character codes 
beyond the ISO standard. However, any valid 10646 BMP (Basic Multilingual Plane) 
sequence is a valid Unicode sequence, and vice versa; Unicode supplies interpretations of 
sequences on which the ISO standard is silent as to interpretation. 


This character set is encoded as sequences of octets, two per 16-bit character, with the 
most significant octet first. Text with an odd number of octets is ill-formed. 


Rationale. ISO/IEC 10646-1:1993(E) specifies that when characters in the UCS-2 form 
are serialized as octets, that the most significant octet appear first. This is also in keeping 
with common network practice of choosing a canonical format for transmission. 


General Specification of Unicode Character Sets Within MIME 


The Unicode Standard is currently at version 1.1. Although new versions should be 
compatible with old implementations if an implementation is compliant with the standard, 
some implementations may choose to check the version of the character set that is being 
used. In order to allow some implementations to check the version number and allow others 
to ignore it, all registrations of Unicode variants and versions for MIME usage should have 
MIME charset names which conform to one of the two following patterns: 
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UNICODE-major-minor 
UNICODE-major-minor-variant 


Where major and minor are strings of decimal digits (0 through 9) specifying the major and 
minor version number of the Unicode standard to which the text in question conforms. In the 
interests of interoperability, the lowest version number compatible with the text should be 
used. The lowest acceptable version number is UNICODE-1-1, corresponding to “The 
Unicode Standard, Version 1.1”. The optional trailing string “variant” describes the 
particular transformation format of Unicode that the registration describes; its content is up 
to the particular registration. If there is no trailing variant string, the charset name refers to 
the basic two octet form of Unicode as described in “The Unicode Standard”. 


Example. A hypothetical charset which referred to the UTF-8 transformation format of 
Unicode/10646 (also known as UTF-2 or UTF-FSS) might be named UNICODE-1-1- 
UTF-8. 


Encoding Character Set Unicode Within MIME 


Character set Unicode uses 16 bit characters, and therefore would normally be used with the 
Binary or Base64 content transfer encodings of MIME. In header fields, it would normally 
be used with the B content transfer encoding. The MIME character set identifier is 
UNICODE-1-1. 


Example. Here is a text portion of a MIME message containing the Japanese word 
“nihongo” (hexadecimal 65E5,672C,8A9E) written in Han characters. 


Content-Type: text/plain; charset=UNICODE-1-1 
Content—Transfer-Encoding: base64 


ZeVnLige 


Example. Here is a text portion of a MIME message containing the Unicode sequence 
“A<NOT IDENTICAL TO><ALPHA>.” (hexadecimal 0041,2262,0391,002E) 


Content-Type: text/plain; charset=UNICODE-1-1 
Content—Transfer—-Encoding: base64 


AEEiYgORAC4= 
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Security Considerations 
Security issues are not discussed in this memo. 
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